Exploratory Data Analysis

Exploring Coal Consumption Data

Exploring the elements of the average temperature time series is required to move on. So that further models may take these components into account, it is crucial to understand how each time series component affects the entire time series. Time series data primarily consist of four elements: level, trend, seasonality, and noise. Noise and unpredictability are nonsystematic components, whereas level, trend, and seasonality are systematic. A greater comprehension of a time series may be attained by breaking it down into these parts.

In this exploratory data analysis, the plots will be looking at a time series for the quarterly U.S. total coal consumption only. Even though the data collected has information on all of the 50 states (and that information will be used in future analysis), it makes sense to scope the analysis to only U.S. for simplification purposes. It also makes sense since all the states follow the same trend of coal consumption.


Time Series Plot

Firstly, it is important to visualize the time series at its most basic level, which is a simple plot of data over time.

Code
coal_df <- read.csv("/Users/raezh1/Documents/Georgetown/ANLY560/website/Time Series/data/coal_us_consumption.csv") |>
  group_by(period) |>
  summarise(Consumption = sum(consumption)) |>
  filter(!period %in% c('2000-Q1', '2000-Q2',  '2000-Q3', '2000-Q4'))

coal_df_ts <- ts(coal_df$Consumption, start = c(2001,1), end = c(2021,1), frequency = 4)
Code
theme_set(theme_gray(base_size=12,base_family="Palatino"))
autoplotly(coal_df_ts) +
  ggtitle("U.S. Quarterly Coal Consumption") +
  xlab("Year (2001-2021)") +
  ylab("Short Ton")


Decomposition

Multiplicative trend means the trend is not linear (curved line), and multiplicative seasonality means there are changes to widths or heights of seasonal periods over time. From the time series graph created above, we should use multiplicative model from decompose() to decompose the coal consumption.

Code
decompose_coal = decompose(coal_df_ts, "multiplicative")
autoplotly(decompose_coal) +
  ggtitle("Decomposition of U.S. Total Coal Consumption")

In R the stl() function performs decomposition of a time series into seasonal, trend and irregular components using loses. stl() will handle any type of seasonality, not only monthly and quarterly data.

Code
coal_df_ts |>
  stl(s.window="periodic", robust=TRUE) |>
  autoplotly()

The two plots of decomposition are very similar except the remainder from stl() function is less smoothed and more noisy than the remainder from decompose() function. They both show U.S. the U.S. quarterly coal consumption data has seasonality. Before 2010, the trend of the data is increasing. After 2010, the trend of the data is decreasing.
Since the data has a trend, it doesn’t has stationarity as a stationary time series is one whose properties do not depend on the time at which the series is observed. The ACF graph below will show more of its non-stationary feature.

Lag Plots

A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the quarterly U.S. coal consumption data.

Code
library(wesanderson)
gglagplot(coal_df_ts, do.lines=FALSE, set.lags = c(4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64)) + 
  xlab("Lags") + 
  ggtitle("Lag Plot of U.S. Total Coal Consumption") +
  theme_minimal() +
  ylab("Coal Consumption in short tons") +
  theme_light() +
  theme(text=element_text(size=12, family="Palatino")) +
  labs(fill = "Legend") +
  scale_color_brewer(palette="Set2") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

From lag plots here we can see that the coal consumption data has serial correlation as the lag plots are showing a linear pattern, which suggests autocorrelation is present. This is a positive linear trend, so the data has positive autocorrelation. However, it’s hard to spot the seasonality from the lag plots, which makes me believe the data doesn’t have seasonality.


Autocorrelation and Stationarity

The function ACF computes (and by default plots) an estimate of the autocorrelation function of a univariate time series. Function PACF computes (and by default plots) an estimate of the partial autocorrelation function of a univariate time series.

Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting.

Code
acf <- ggAcf(coal_df_ts, 80) + ggtitle("ACF Plot of U.S. Total Coal Consumption")
ggplotly(acf)

Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 4th quarter, 8th quarter, and 12th quarter, etc. It means the data has yearly seasonality.

A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.

As well as looking at the time plot of the data, the ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Our ACF graph decreases slowly and we can clearly see the data is not stationary.

Code
pacf <- ggPacf(coal_df_ts, 50) + ggtitle("PACF Plot of U.S. Total Coal Consumption")
ggplotly(pacf)

From the PACF plot, we can see a large spike at lag 1 that decreases after a few lags, which indicates a moving average term in the data.


Differencing and Detrending

Differences and Seasonal Differences Estimation with Stationarity and Seasonality Tests

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test.

ADF test is conducted with the following assumptions.

Null Hypothesis (HO): Series is non-stationary or series has a unit root.

Alternate Hypothesis(HA): Series is stationary or series has no unit root.

If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.

Conditions to Reject Null Hypothesis(HO)

Code
adf.test(coal_df_ts |> log())

    Augmented Dickey-Fuller Test

data:  log(coal_df_ts)
Dickey-Fuller = -2.9525, Lag order = 4, p-value = 0.1858
alternative hypothesis: stationary

The p value of ADF test is greater than 0.05, so we cannot reject null hypothesis and it means the U.S. total coal consumption data is not stationary. The rest result matches with the conclusion we had after observing decomposition plot and ACF plot.

Let’s take the first differencing and do the ADF test again.

Code
adf.test(coal_df_ts |> log() |> diff())
Warning in adf.test(diff(log(coal_df_ts))): p-value smaller than printed p-value

    Augmented Dickey-Fuller Test

data:  diff(log(coal_df_ts))
Dickey-Fuller = -4.3984, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

We can see that after first differencing, the p value is below 0.05 which means it’s stationary now and we don’t need the second differencing.

In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required.

Code
library(urca)
coal_df_ts |> ur.kpss() |> summary()

####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 3 lags. 

Value of test-statistic is: 1.7562 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

The test statistic is much bigger than the 5% critical value, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.

Code
coal_df_ts |> log() |> diff(lag=4) |> diff() |> ur.kpss() |> summary()

####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 3 lags. 

Value of test-statistic is: 0.0871 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

This time, the test statistic is tiny, and well within the range we would expect for stationary data, so we can conclude that the differenced data are stationary.

A similar function for determining whether seasonal differencing is required is nsdiffs(), which uses the measure of seasonal strength introduced to determine the appropriate number of seasonal differences required. ndiffs() estimates the number of first differences. No seasonal differences are suggested if the result is less than 0.64, otherwise one seasonal difference is suggested.

We can apply nsdiffs() to the logged U.S. Total Coal Consumption data.

Code
coal_df_ts |> log() |> ndiffs()
[1] 1
Code
coal_df_ts |> log() |> diff() |> nsdiffs()
[1] 1
Code
coal_df_ts |> log() |> diff(lag=4) |> nsdiffs()
[1] 0

Because nsdiffs() returns 1 on the original data(indicating one seasonal difference is required), we apply the function again to the seasonally differenced data. These two functions suggest we should do both a seasonal difference and a first difference.

In conclusion, after trying several differences and seasonal differences combinations, taking one seasonal difference and one difference is what the data needs to be stationary.

Differencing Results

Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

Code
cbind("Original" = coal_df_ts,
      "Log Trans" = log(coal_df_ts),
      "Sea. Diff" = diff(log(coal_df_ts),lag=4),
      "1st & Sea. Diff" = diff(diff(log(coal_df_ts),lag=4))) |>
  autoplotly(facets=TRUE) +
    xlab("Year") + ylab("") +
    ggtitle("Differencing of U.S. Total Coal Consumption") 

New Stationary Data

Finally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.

Code
`Log Transformation` <- log(coal_df_ts)
`Remove Seasonality` <- diff(log(coal_df_ts), lag=4)
`First Diff After Remove Seasonality` <- diff(diff(log(coal_df_ts), lag=4))

a <- ggAcf(coal_df_ts,70) + ggtitle("Original Data")
b <- ggAcf(`Log Transformation`,70) + ggtitle("Log Transformation")
c <- ggAcf(`Remove Seasonality`,70) + ggtitle("Remove Seasonality")
d <- ggAcf(`First Diff After Remove Seasonality`,70) + ggtitle("First Diff After Remove Seasonality")

ACF Plots

Code
ggarrange(a, b, ncol = 1, nrow = 2)

Code
ggarrange(c, d, ncol = 1, nrow = 2)

The first three graphs above are not as ideal as the last ACF graph which is First Diff After Remove Seasonality.

For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it’s not stationary either.

This ACF graph of detrended and differenced U.S. total coal consumption data shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.

Exploring U.S. CO2 Emissions Data

Time Series Plot

Code
emissions_df <- read.csv("/Users/raezh1/Documents/Georgetown/ANLY560/website/Time Series/data/df_total_monthly_CO2_emissions.csv", skip=1, row.names = 1)

#emissions_ts <- ts(emission_df$sum, star=decimal_date(as.Date("1973-01-01",format = "%Y-%m-%d")),frequency = 12)
emissions_ts <- ts(emissions_df$sum, start = c(1973,1), end = c(2022,12), frequency = 12)

autoplotly(emissions_ts) +
  xlab("Year") + ylab("Million Metric Tons of Carbon Dioxide") +
  ggtitle("U.S. Monthly CO2 Emissions")


Decomposition

Multiplicative trend means the trend is not linear (curved line), and multiplicative seasonality means there are changes to widths or heights of seasonal periods over time. From the time series graph created above, we should use multiplicative model from decompose() to decompose the coal consumption.

Code
decompose_emissions = decompose(emissions_ts, "multiplicative")
autoplotly(decompose_emissions) +
  ggtitle("Decomposition of U.S. Total CO2 Emissions")

In R the stl() function performs decomposition of a time series into seasonal, trend and irregular components using loses. stl() will handle any type of seasonality, not only monthly and quarterly data.

Code
emissions_ts |>
  stl(s.window="periodic", robust=TRUE) |>
  autoplotly() +
  ggtitle("Decomposition of U.S. Total CO2 Emissions")

The two plots of decomposition are very similar except the remainder from stl() function is less smoothed and more noisy than the remainder from decompose() function. They both show U.S. the U.S. monthly CO2 emissions data has seasonality. Before 2008, the trend of the data is increasing. After 2008, the trend of the data is decreasing.
Since the data has a trend, it doesn’t has stationarity as a stationary time series is one whose properties do not depend on the time at which the series is observed. The ACF graph below will show more of its non-stationary feature.

Lag Plots

A lag plot is a type of scatter plot where time series are plotted in pairs against itself some time units behind or ahead. Lag plots often reveal more information on the seasonality of the data, whether there is randomness in the data or an indication of autocorrelation in the data. Below is a lag plot for the quarterly U.S. coal consumption data.

Code
library(wesanderson)
gglagplot(emissions_ts, do.lines=FALSE, set.lags = c(1,12,12*2,12*3,12*4,12*5,12*6,12*7,12*8,12*9,12*10,12*11,12*12,12*13,12*14,12*15)) + 
  xlab("Lags") + 
  ggtitle("Lag Plot of U.S. CO2 Emissions") +
  theme_minimal() +
  ylab("Million Metric Tons of Carbon Dioxide") +
  theme_light() +
  theme(text=element_text(size=12, family="Palatino")) +
  labs(fill = "Legend") +
  #scale_color_brewer(palette="Set2") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

From lag plots here we can see that the CO2 emissions data has serial correlation as the lag plots are showing a linear pattern, which suggests autocorrelation is present. This is a positive linear trend, so the data has positive autocorrelation. However, it’s hard to spot the seasonality from the lag plots, which makes me believe the data doesn’t have seasonality.


Autocorrelation and Stationarity

The function ACF computes (and by default plots) an estimate of the autocorrelation function of a univariate time series. Function PACF computes (and by default plots) an estimate of the partial autocorrelation function of a univariate time series.

Autocorrelation and partial autocorrelation plots are heavily used in time series analysis and forecasting.

Code
acf <- ggAcf(emissions_ts, 80) + ggtitle("ACF Plot of U.S. Total Coal Consumption")
ggplotly(acf)

Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 12th, 24th, and 36th, etc. It means the data has yearly seasonality.

A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.

As well as looking at the time plot of the data, the ACF plot is also useful for identifying non-stationary time series. For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Our ACF graph decreases slowly and we can clearly see the data is not stationary.

Code
pacf <- ggPacf(emissions_ts, 80) + ggtitle("PACF Plot of U.S. Total CO2 Emissions")
ggplotly(pacf)

From the PACF plot, we can see a large spike at lag 1 that decreases after 13 lags, which indicates a moving average term in the data.


Differencing and Detrending

Differences and Seasonal Differences Estimation with Stationarity and Seasonality Tests

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test.

ADF test is conducted with the following assumptions.

Null Hypothesis (HO): Series is non-stationary or series has a unit root.

Alternate Hypothesis(HA): Series is stationary or series has no unit root.

If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.

Conditions to Reject Null Hypothesis(HO)

Code
adf.test(emissions_ts |> log())

    Augmented Dickey-Fuller Test

data:  log(emissions_ts)
Dickey-Fuller = -2.4005, Lag order = 8, p-value = 0.4088
alternative hypothesis: stationary

The p value of ADF test is greater than 0.05, so we cannot reject null hypothesis and it means the U.S. total coal consumption data is not stationary. The rest result matches with the conclusion we had after observing decomposition plot and ACF plot.

Let’s do the ADF test after first differencing.

Code
adf.test(emissions_ts |> log() |> diff())
Warning in adf.test(diff(log(emissions_ts))): p-value smaller than printed
p-value

    Augmented Dickey-Fuller Test

data:  diff(log(emissions_ts))
Dickey-Fuller = -14.873, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary

We can see that after first differencing, the p value is below 0.05 which means it’s stationary now and we don’t need the second differencing.

In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required.

Code
library(urca)
emissions_ts |> log() |> ur.kpss() |> summary()

####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 6 lags. 

Value of test-statistic is: 3.9813 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

The test statistic is much bigger than the 5% critical value, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.

Code
emissions_ts |> log() |> diff() |> ur.kpss() |> summary()

####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 6 lags. 

Value of test-statistic is: 0.0218 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739

This time, the test statistic is small and well within the range we would expect for stationary data, so we can conclude that the differenced data are stationary.

A similar function for determining whether seasonal differencing is required is nsdiffs(), which uses the measure of seasonal strength introduced to determine the appropriate number of seasonal differences required. No seasonal differences are suggested if the result is less than 0.64, otherwise one seasonal difference is suggested.

We can apply nsdiffs() to the logged U.S. Total CO2 Emissions data.

Code
emissions_ts |> log() |> ndiffs()
[1] 1
Code
emissions_ts |> log() |> diff() |> nsdiffs()
[1] 1
Code
emissions_ts |> log() |> diff() |> diff(lag=12) |> nsdiffs()
[1] 0

Because both nsdiff() and nsdiffs() returns 1 (indicating one seasonal difference and one difference are required), below we apply the ndiffs() function to the seasonally differenced data.

lag=12 means taking a yearly seasonal difference, and the result is less than 0.64 which means our time series data doesn’t have seasonality anymore after taking a yearly seasonal difference

In conclusion, after trying several differences and seasonal differences combinations, taking one seasonal difference and one difference is what the this data needs to be stationary.

Differencing Results

Code
cbind("Original" = emissions_ts,
      "Log Trans" = log(emissions_ts),
      "Sea. Diff" = diff(log(emissions_ts),lag=12),
      "1st & Sea. Diff" = diff(diff(log(emissions_ts),lag=12))) %>%
  autoplotly(facets=TRUE) +
    xlab("Year") + ylab("") +
    ggtitle("Differencing of U.S. Total CO2 Emissions") 

New Stationary Data

Finally, the seasonality and correlation should be removed to make the time series stationary. A comparison of all of the methods is seen below. Once the transformations are applied, the series is stationary. This can also be seen in the plot below.

Code
`Log Transformation` <- log(emissions_ts)
`Remove Seasonality` <- diff(log(emissions_ts), lag=12)
`First Diff After Remove Seasonality` <- diff(diff(log(emissions_ts), lag=12))

a <- ggAcf(emissions_ts,70) + ggtitle("Original Data")
b <- ggAcf(`Log Transformation`,70) + ggtitle("Log Transformation")
c <- ggAcf(`Remove Seasonality`,70) + ggtitle("Remove Seasonality")
d <- ggAcf(`First Diff After Remove Seasonality`,70) + ggtitle("First Diff After Remove Seasonality")

ACF Plots

Code
ggarrange(a, b, ncol = 1, nrow = 2)

Code
ggarrange(c, d, ncol = 1, nrow = 2)

The first three graphs above are not as ideal as the last ACF graph which is First Diff After Remove Seasonality.

For white noise series and stationary series, we expect each autocorrelation to be close to zero. Of course, they will not be exactly equal to zero as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within ±2/√T where T is the length of the time series. It is common to plot these bounds on a graph of the ACF (the blue dashed lines above). If one or more large spikes are outside these bounds, or if substantially more than 5% of spikes are outside these bounds, then the series is probably not white noise, and it’s not stationary either.

This ACF graph of detrended and differenced U.S. total CO2 Emissions data shows almost all the spikes are inside the blue bounds, which proves the detrended and differenced data is stationary and ready to be utilized in future analysis.